Objective
This vignette describes how copy number variation (CNV) can
be identified using Mosaic. Data filtering, normalization and
ploidy computation are described.
The h5 file used in this notebook can be found here
import missionbio.mosaic.io as mio
sample = mio.load('./sample.analyzed.h5')
This is an analyzed h5 file. Hence the clones and clusters
are already labeled. It contains three cell lines KG-1,
Tom-1, and Jurkat. It also contains doublets of each pair
as seen in the heatmap.
For your h5 file, identify the clones using the DNA variants
before proceeding with this vignette.
sample.dna.heatmap('NGT')
We will only be working with the DNA and CNV assays.
The CNV object contains 138 amplicons. But not all
of them worked as expected. We will drop amplicons
which worked in less than half the total cells.
# This returns the reads in each cell and amplicon
reads = sample.cnv.get_attribute('read_counts', constraint='row+col')
reads
# Only amplicons found in more than half the cells are analyzed
# The other amplicons are dropped.
working_amplicons = (reads.median() > 0).values
sample.cnv = sample.cnv[:, working_amplicons]
# We are now left with 135 amplicons
sample.cnv.shape
Finding the ploidy of the cells can be split into
two distinct calculations.
Normalization
This step is required to correct for systemic artefacts.
Cell to cell, and amplicon to amplicon variations are corrected.
Each cell is normalized to it's total reads and each amplicon
to its median counts. This is performed using the normalize_reads
method, which adds the normalized_counts layer to the CNV object.
Correcting the baseline
Not all amplicons show copy loss or copy gain. In the normalization
step all amplicons in a cell were normalized using the same value
i.e. the total reads in the cell. This results in slight deviations
from the true ploidy value of the cell for each amplicon.
To correct for this, a clone which is known to be diploid is identified
and all the other cells are normalized to the counts of this clone. This
step is done using the compute_ploidy method, and providing it
with the barcodes known to be diploid. This adds the ploidy layer
to the CNV object.
# Reads are normalized to correct for systemic artefacts
sample.cnv.normalize_reads()
# Assuming Jurkat cells are diploid for all the amplicons,
# we can compute the ploidy of the other cell lines as follows
sample.cnv.compute_ploidy(diploid_cells=sample.dna.barcodes('Jurkat'))
The ploidy values are stored in the ploidy layer
These can be visualized using a heatmap or a ploidy line plot for each clone.
# Assign the DNA labels to the CNV object
# We want to ensure the labels are the same
sample.cnv.set_labels(sample.dna.get_labels())
sample.cnv.set_palette(sample.dna.get_palette())
# Heatmap with the features ordered by the default amplicon order
sample.cnv.heatmap('ploidy', features=sample.cnv.ids())
As seen in the heatmap, KG-1 has a copy loss for some
of the amplicons in the panel. TOM-1 does not show much
deviation from ploidy 2.
This can also be visualized by the per amplicon ploidy plot.
This plot is generated for each clone separately.
# KG-1 shows copy loss at two different locations
sample.cnv.plot_ploidy('KG-1')
# TOM-1 is diploid for all the amplicons
sample.cnv.plot_ploidy('TOM-1')
# Since Jurkat was used to set the baseline,
# it is perfectly diploid for all amplicons
sample.cnv.plot_ploidy('Jurkat')
Often copy loss or copy gain will be seen in a set of amplicons.
These amplicons might belong to the same gene. Cross checking the
amplicon positions with the gene positions can be useful.
Multiple amplicons showing a CNV makes the occurence much more
likely to be a true positive observation.